An instructive example using Stan…
Setting up and stuff…
rstan::expose_stan_functions("quantile_functions.stan")
source("GLD_helpers.R")
summary(cps$rw)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.079 10.794 16.299 20.056 25.690 188.898 13322
NB I am recoding the “wbho” variable (Race) to be a Boolean (White/Non-White) 1/0.
Now, using the theoretical model above - where I think that one’s gender, race, marrital status, and age (in decades) have a bearing on one’s hourly wages, plus some intercept and an error term - let’s construct some prior predictive distribution…
set.seed(42)
# y = m*x + c
# y = alpha + beta*x
# b1/Female = cps$female, b2/Race = cps$wbho, b3/Married = cps$married, b4/Age = cps$age
# Setting up my priors...
# NB Median should be about equidistant between the lower and upper quartiles.
(a_s1 <- GLD_solver(lower_quartile = -0.3, median = -0.2, upper_quartile = -0.1, # GLD priors for coefficients.
other_quantile = 0.2, alpha = 0.9))
## asymmetry steepness
## 0.0000000 0.9653298
(a_s2 <- GLD_solver(lower_quartile = -0.01, median = 0, upper_quartile = 0.01,
other_quantile = 0.03, alpha = 0.9))
## asymmetry steepness
## 0.0000000 0.9283763
(a_s3 <- GLD_solver(lower_quartile = 0.3, median = 0.5, upper_quartile = 0.65,
other_quantile = 0.85, alpha = 0.9))
## asymmetry steepness
## -0.4226581 0.9033965
(a_s4 <- GLD_solver(lower_quartile = 0, median = 0.2, upper_quartile = 0.3,
other_quantile = 0.5, alpha = 0.9))
## asymmetry steepness
## -0.6771304 0.9722460
m_alpha <- log(16); s_alpha <- 0.1 # Normal prior for alpha. At average variable values (x1, x2, x3, x4)...
q_sigma <- c(lower = 0.25, median = 1, upper = 2)
d <- model.matrix(log(rw) ~ female + wbho + married + age_dec, # Creating a centered matrix with our data.
data = cps)[ , -1] # To drop the intercept.
d <- sweep(d, MARGIN = 2, STATS = colMeans(d), FUN = `-`) # Centering the columns.
ppd <- replicate(1000, {
alpha_ <- rnorm(1, mean = m_alpha, sd = s_alpha)
sigma_ <- JQPDS_rng(lower_bound = 0, alpha = 0.25, quantiles = q_sigma) # Semi-bounded for sigma.
beta1_ <- GLD_rng(median = -0.2, IQR = -0.1 - (-0.3), asymmetry = a_s1[1], steepness = a_s1[2]) # IQR = upper - lower.
beta2_ <- GLD_rng(median = 0, IQR = 0.01 - (-0.01), asymmetry = a_s2[1], steepness = a_s2[2])
beta3_ <- GLD_rng(median = 0.5, IQR = 0.65 - 0.3, asymmetry = a_s3[1], steepness = a_s3[2])
beta4_ <- GLD_rng(median = 0.2, IQR = 0.3 - 0, asymmetry = a_s4[1], steepness = a_s4[2])
mu_ <- alpha_ + beta1_ * d[ , "female"] + beta2_ * d[ , "wbho"] + beta3_ * d[, "married"] + beta4_ * I(d[, "age_dec"] ^ 2)
epsilon_ <- rnorm(n = length(mu_), mean = 0, sd = sigma_) # Some error term.
y_ <- mu_ + epsilon_ # y_ has a normal distribution with expectation mu_.
y_
})
round(quantile(c(exp(ppd)), probs = 1:9 / 10), digits = 3)
## 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 0.796 4.699 9.661 14.476 19.478 25.829 37.660 69.993 248.865
This gives an unreasonable distribution at the ninth and first deciles. (Too high and too low, respectively.) We would need to either change the priors and/or include more variables in the model.
post <- stan_lm(log(rw) ~ female + wbho + married + I(age_dec ^ 2), data = cps,
prior_intercept = normal(location = m_alpha, scale = s_alpha, autoscale = FALSE), # location from m_alpha.
prior = R2(location = median(r_squared), what = "median"), adapt_delta = 0.99) # location obtained from my R^2 above.
print(post, digits = 4)
## stan_lm
## family: gaussian [identity]
## formula: log(rw) ~ female + wbho + married + I(age_dec^2)
## observations: 14799
## predictors: 5
## ------
## Median MAD_SD
## (Intercept) 2.8281 0.0169
## female -0.3055 0.0089
## wbho -0.1041 0.0117
## married 0.2311 0.0093
## I(age_dec^2) 0.0077 0.0005
##
## Auxiliary parameter(s):
## Median MAD_SD
## R2 0.1483 0.0054
## log-fit_ratio 0.0000 0.0057
## sigma 0.5493 0.0032
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
loo(post) # Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO).
##
## Computed from 4000 by 14799 log-likelihood matrix
##
## Estimate SE
## elpd_loo -12138.9 98.6
## p_loo 6.6 0.2
## looic 24277.7 197.2
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
pp_check(post, plotfun = "loo_pit_overlay")
## Running PSIS to compute weights...
We see that the model - as is - is decidedly lacking in the lower tail though otherwise seems to fit reasonably.
## (Intercept) female wbho married age_dec
## TRUE FALSE FALSE TRUE TRUE
## sigma log-fit_ratio R2
## TRUE FALSE TRUE
Somewhat predictably different, perhaps, in the direction but not so in the magnitude, one might say.
library(bayesplot)
mcmc_areas_ridges(post, regex_pars = "^[^(]") # Excluding (Intercept).
# I'm interested in the variables and mean PPD...
pairs(post, regex_pars = "^[^(lRs]") # Exclude parameters that start with (, l, R, or s. This takes care of (Intercept), sigma, log-fit_ratio, R2, and log-posterior.
While the model originally yielded a few divergent transitions below the diagonal, indicating - empirically - that it’s having a hard time dealing with the tails - though we knew that much from the overlaid density, in which case it may have been tempting to consider changing our model to simplify the geometry of the posterior distribution - it is somewhat oblique having to deal with divergent transitions below the diagonal. By setting my target acceptance rate at .999 - the default adapt_delta parameter being .8 - I have accomplished to take care of that particular problem without having to amend the model or my priors.